By statisticians and academics for statisticians and academics

https://stackoverflow.blog/2017/10/10/impressive-growth-r/

Packages to your heart’s desire

There is a plethora of packages that can be downloaded and used:

And they are easy to install!

#CRAN
install_package('tidyverse')
#GitHub
library(devtools)
install_github("gretat/sdt.rmcs")
meme::meme('1200px-Puking_Rainbows.jpg', "Packages", color = 'blue', size = 5, vjust = 0.05)

The Tidy Universe

library(tidyverse)
## -- Attaching packages ---------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.1     v purrr   0.3.4
## v tibble  3.0.1     v dplyr   1.0.0
## v tidyr   1.1.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts ------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Scottish wellbeing and access to green and blue spaces

Lets’ get some data from the Scottish Government at statistics.gov.scot

We want to explore how Scottish people experience mental wellbeing and is it related to how far they are from a park or the sea.

green_blue <- read_csv("Distance to green or blue space- scottish household survey.csv")
wellbeing <- read_csv('Mental wellbeing- Scottish Surveys.csv')
geo_area <- read_csv('geo_area.csv')

Lets have a quick look the acces to green and blue spaces data

There are variety of ways to preview a datases

#the straightforward way
head(green_blue)
## # A tibble: 6 x 13
##   FeatureCode DateCode Measurement Units Value `Distance to Ne~ Age   Gender
##   <chr>          <dbl> <chr>       <chr> <dbl> <chr>            <chr> <chr> 
## 1 S12000050       2013 95% Upper ~ Perc~   2.6 Don't Know       35-6~ All   
## 2 S92000003       2018 95% Lower ~ Perc~   1.7 Don't Know       All   All   
## 3 S12000013       2017 95% Lower ~ Perc~   0   Don't Know       35-6~ All   
## 4 S12000041       2018 95% Upper ~ Perc~   3.6 Don't Know       All   Female
## 5 S12000028       2014 Percent     Perc~  73   A 5 minute walk~ All   Male  
## 6 S12000014       2013 95% Lower ~ Perc~   0   Don't Know       35-6~ All   
## # ... with 5 more variables: `Urban Rural Classification` <chr>, `SIMD
## #   quintiles` <chr>, `Type Of Tenure` <chr>, `Household Type` <chr>,
## #   Ethnicity <chr>
#top N number of rows but not really useful with ties
top_n(wellbeing,10)
## # A tibble: 564 x 10
##    FeatureCode DateCode Measurement Units Value `Type Of Tenure`
##    <chr>       <chr>    <chr>       <chr> <dbl> <chr>           
##  1 S08000016   2017     Lower 95% ~ SWEM~  24.4 All             
##  2 S08000031   2014     Upper 95% ~ SWEM~  25.2 All             
##  3 S12000030   2015     Upper 95% ~ SWEM~  25.4 All             
##  4 S12000029   2017     Lower 95% ~ SWEM~  25.1 All             
##  5 S08000031   2016     Lower 95% ~ SWEM~  24.7 All             
##  6 S12000018   2017     Upper 95% ~ SWEM~  24.8 All             
##  7 S12000027   2015     Mean        SWEM~  25.7 All             
##  8 S12000045   2017     Mean        SWEM~  24.9 All             
##  9 S12000017   2015     Mean        SWEM~  25.0 All             
## 10 S08000030   2016     Upper 95% ~ SWEM~  25.1 All             
## # ... with 554 more rows, and 4 more variables: `Household Type` <chr>,
## #   Age <chr>, Gender <chr>, `Limiting Long-term Physical or Mental Health
## #   Condition` <chr>
glimpse(geo_area)
## Rows: 532
## Columns: 2
## $ FeatureCode <chr> "S92000003", "S12000033", "S12000034", "S12000041", "S1...
## $ GeoArea     <chr> "Scotland", "Aberdeen City", "Aberdeenshire", "Angus", ...

Tidying up

So let’s tidy the data, because neither of the datasets are particularly easy to read or work with. This will also demonstrate the ‘AND THEN’ philosophy.

green_blue_tidy <- green_blue %>% # pick the dataset AND THEN
  filter(Gender == 'All',
         `SIMD quintiles` == 'All',
         `Urban Rural Classification` == 'All', 
         Age != 'All', 
         Measurement == 'Percent',
         `Distance to Nearest Green or Blue Space` != "Don't Know") %>% #filter in only useful and relevant data AND THEN
  inner_join(geo_area, by = 'FeatureCode') %>% #get names of the Scottish Councils AND THEN
  select(Value, Age, `Distance to Nearest Green or Blue Space`, DateCode, GeoArea, FeatureCode) %>% #keep only relevant variables AND THEN etc...
  rename(Percent_population = Value)

knitr::kable(green_blue_tidy[1:5,])
Percent_population Age Distance to Nearest Green or Blue Space DateCode GeoArea FeatureCode
68 65 years and over A 5 minute walk or less 2014 South Ayrshire S12000028
80 35-64 years A 5 minute walk or less 2017 Shetland Islands S12000027
66 16-34 years A 5 minute walk or less 2014 South Lanarkshire S12000029
90 16-34 years A 5 minute walk or less 2017 Shetland Islands S12000027
79 35-64 years A 5 minute walk or less 2015 Shetland Islands S12000027

Now let’s wrangle the wellbeing dataset to a similar format so we can play with it. Wrangling follows pretty much the same format almost all of the time. This is what makes it so easy for students to understand.

#wellbeing
wellbeing_tidy <- wellbeing %>% 
  filter(Gender == 'All', 
         !Age %in% c('All', '16-64 years'), 
         `Type Of Tenure` == 'All',
         !`DateCode` %in% c('2014-2015', '2014-2017', '2016-2017'),
         Measurement == 'Mean') %>% 
  inner_join(geo_area, by = 'FeatureCode') %>% 
  select(Value, Age, DateCode, GeoArea, FeatureCode) %>% 
  mutate_at("DateCode", as.numeric) %>% 
  rename(Score= Value) 

Of course you can do a lot more with R, but if your wrangling is straight forward then the data analysis later is easier.

GGPLOT2: for your pretty plot

Now let’s see how plotting works:

ggplot2 comes with tidyverse and makes lovely plots. It is the same layering pattern.

ggplot(filter(green_blue_tidy, GeoArea %in% c('Glasgow City', 'City of Edinburgh' , 'Highland')),
       aes(x = (DateCode-2000), y = Percent_population)) +  
  geom_point(aes(colour = `Distance to Nearest Green or Blue Space`), size = 2) +
  facet_grid(GeoArea~Age) +
  scale_x_continuous(breaks = c(13:18)) +
  scale_colour_discrete(name="Distance to Green/Blue Space",
                         breaks=c("A 5 minute walk or less", "Within a 6-10 minute walk", "An 11 minute walk or more"),
                         labels=c("5 min or less", "6-10 min", "11 min or more")) +
  theme_bw() +
  labs(y = "Percentage", x = "Year") +
  theme(axis.title = element_text(size = 12, colour = 'blue'),
        axis.text = element_text(size = 10), 
        strip.text = element_text(size = 11),
        legend.position = 'top',
        legend.text = element_text(size = 10),
        legend.title = element_text(size = 12))

ggplot(filter(wellbeing_tidy, GeoArea %in% c('Glasgow City', 'City of Edinburgh' , 'Highland')),
       aes(x = (DateCode-2000), y = Score)) +  
  geom_smooth(method = 'lm', se = FALSE) +
  geom_point(size = 2) +
  facet_grid(GeoArea~Age) +
  theme_bw() +
  labs(y = "Mental Wellbeing Score", x = "Year") +
  scale_y_continuous(breaks = c(23:27),limits = c(23,27))+
  scale_x_continuous(breaks = c(13:17)) +
  theme(axis.title = element_text(size = 12, colour = 'blue'),
        axis.text = element_text(size = 10), 
        strip.text = element_text(size = 11))

The statistics

Lets remove overall Scotland data first and add proportion of people living near a park or the sea.

Note: Data is not available for every area, every year for both datasets

wellbeing_bg <- filter(wellbeing_tidy, GeoArea != 'Scotland') %>% 
 inner_join(green_blue_tidy, by = c('Age', 'FeatureCode', 'DateCode', 'GeoArea')) %>% 
  filter(`Distance to Nearest Green or Blue Space` == 'A 5 minute walk or less') %>% 
  select(-`Distance to Nearest Green or Blue Space`) %>% 
  mutate_at(c('Score', 'Percent_population'), scale)


summary(lm(Score ~ DateCode*Percent_population, data = wellbeing_bg))
## 
## Call:
## lm(formula = Score ~ DateCode * Percent_population, data = wellbeing_bg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5932 -0.6531 -0.0632  0.6044  4.2773 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                 234.78366  100.20544   2.343   0.0198 *
## DateCode                     -0.11649    0.04972  -2.343   0.0198 *
## Percent_population          136.00528  100.48591   1.353   0.1769  
## DateCode:Percent_population  -0.06742    0.04986  -1.352   0.1773  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9827 on 309 degrees of freedom
## Multiple R-squared:  0.0436, Adjusted R-squared:  0.03432 
## F-statistic: 4.696 on 3 and 309 DF,  p-value: 0.003194
#and in ANOVA terms
#Type-III variance decomposition - Simultaneous testing of effects

car::Anova(lm(Score ~ DateCode*Percent_population, data = wellbeing_bg), type = 'III') 
## Anova Table (Type III tests)
## 
## Response: Score
##                              Sum Sq  Df F value  Pr(>F)  
## (Intercept)                   5.301   1  5.4898 0.01976 *
## DateCode                      5.302   1  5.4902 0.01976 *
## Percent_population            1.769   1  1.8319 0.17689  
## DateCode:Percent_population   1.766   1  1.8285 0.17730  
## Residuals                   298.397 309                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Yikes! We need more yoga!

Plot it on a map

Is wellbeing location based?

Let’s make a map of Scotland and see how wellbeing changes through the years.

library(tmap)
library(tmaptools)
#get shape file
scot <-  sf::read_sf("SG_SIMD_2016.shp") #data: https://www2.gov.scot/Topics/Statistics/SIMD

#add data we have to shape file
scot_wellbeing <- left_join(scot, filter(wellbeing_tidy, Age == '35-64 years'), by = c("LAName"='GeoArea'))

well <- tm_shape(scot_wellbeing) + # Layer 1. what is our shape
  tm_fill(col = 'Score', # Layer2 : fill or colour based on the Attending Education column
          palette = 'div')+ # choose a palette
  tm_style("cobalt") +
  tm_layout(legend.width = 2) +
  tm_facets(along = 'DateCode', free.scales.symbol.size = FALSE, nrow=1,ncol=1) # choose a facet
tmap_animation(well, filename = "well_anim.gif", delay = 80, loop = TRUE, width=1200)

Greenhouse gases and birds

There are no limitations with plots

Do you want to make a plot that shows you are a fan of birds?

Get an image of the internet and use it to plot bird feathers! There is a package for that: ggimage

library(ggimage)
#more data from statistics.gov.scot
greenhouse_gas <- read_csv('Greenhouse Gas emissions by source sector.csv')
birds <- read_csv('Terestrial breeding birds.csv')

#greenhouse gas

greenhouse_gas_tidy <- greenhouse_gas %>% # select the data AND THEN
  filter(Pollutant == 'CO2') %>% 
  inner_join(geo_area, by = 'FeatureCode') %>% 
  select(Value, `Greenhouse gas source sector`, DateCode, GeoArea) %>% 
  rename(Co2_count= Value) %>% 
  arrange(DateCode)

birds_tidy <- birds %>% 
   select(Value, DateCode) %>% 
  rename(count_base1994 = Value) %>% 
  arrange(DateCode)

green_bird <- inner_join(birds_tidy, greenhouse_gas_tidy, by = 'DateCode') %>% 
  mutate(img = 'feather.png')



ggplot(filter(green_bird, DateCode %in% c(2005:2014))) + 
  geom_col(aes(x = as.character(DateCode), y= Co2_count, fill = `Greenhouse gas source sector`)) +
  geom_image(aes(as.character(DateCode), y = count_base1994, image = img)) +
  theme_bw()+
  labs(y = "CO2 Emissions/Terestrial bird population", x = "Year") +
  theme(axis.title.x = element_text(size = 12, colour = 'blue'),
        axis.text = element_text(size = 10), 
        legend.text = element_text(size = 11),
        legend.title = element_text(size = 12),
        legend.position = 'bottom')

The End

meme::meme('success-kid-300x199.jpg', "Learned R", "Used it every day!", color = 'blue', size = 7, vjust = 0.2)

Package references

Fox J. and Weisberg S. (2019). An {R} Companion to Applied Regression, Third Edition. Thousand Oaks CA: Sage. URL: https://socialsciences.mcmaster.ca/jfox/Books/Companion/

Pebesma, E., (2018). Simple Features for R: Standardized Support for Spatial Vector Data. The R Journal, 10 (1), 439-446, https://doi.org/10.32614/RJ-2018-009

Tennekes M (2018). “tmap: Thematic Maps in R.” Journal of Statistical Software, 84(6), 1-39. doi: 10.18637/jss.v084.i06 (URL: https://doi.org/10.18637/jss.v084.i06).

Tennekes M (2020). tmaptools: Thematic Map Tools. R package version 3.0. https://CRAN.R-project.org/package=tmaptools

Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686

Yihui Xie (2020). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.28.

Yihui Xie (2015). Dynamic Documents with R and knitr. 2nd edition. Chapman and Hall/CRC. ISBN 978-1498716963

Yihui Xie (2014). knitr: A Comprehensive Tool for Reproducible Research in R. In Victoria Stodden, Friedrich Leisch and Roger D. Peng, editors, Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595

Yu, G. (2020). ggimage: Use Image in ‘ggplot2’. R package version 0.2.8. https://CRAN.R-project.org/package=ggimage

Yu G. (2019). meme: Create Meme. R package version 0.2.2. https://CRAN.R-project.org/package=meme